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1  Foreward 

This  “final  report”  actually  describes  a  project  still  in  progress,  having  recently  transitioned  from  Brown 
University  to  the  University  of  Texas  at  Austin  (which  Wallace,  the  PI,  joined  in  fall  2014).  The  project  has 
transitioned  smoothly.  Wallace  is  currently  working  with  one  PhD  student  (in  computational  linguistics)  and 
an  undergraduate  in  Computer  Science  on  furthering  the  methodological  research  for  this  project.  Moreover, 
David  Beaver,  Professor  in  the  Linguistics  and  Philosophy  at  UT  Austin,  has  joined  the  project,  replacing 
Brown  linguist  Laura  Kertz.  Brown  Professor  in  Computer  Science  Eugene  Charniak  remains  on  the  project 
to  provide  guidance  on  natural  language  processing  methods. 

We  also  note  that  this  project  has  recently  spurred  collaboration  with  researchers  in  Portugal,  specifically 
with  Dr.  Paula  Carvalho  [T]  and  her  team.  With  these  researchers,  Wallace  joined  a  proposal  that  just 
awarded  to  the  team  through  the  Scientific  Research  and  Technological  Development  Project  in  Interactive 
and  UT  Austin-Portugal  Program  Digital  Media  and  Emerging  Technologies  program,  entitled  “Expression 
and  Recognition  of  Irony  in  Multicultural  Social  Media” .  This  will  provide  opportunity  for  furthering  the 
work  and  to  disseminate  the  ideas  and  methods  produced  through  the  current  work. 

In  this  report,  however,  we  focus  mainly  on  the  progress  made  on  the  project  prior  to  its  transfer  to  UT 
Austin. 


2  Background 

The  research  objective  of  this  project  is  to  develop  resources  and  novel  computational  methods  to  advance 
automated  irony  detection  (i.e.,  identification  of  the  ironic  voice  in  online  content).  This  is  a  challenging  task 
because  the  meaning  of  natural  language  is  not  captured  by  words  and  syntax  alone.  Rather,  utterances 
(tweetsQ  sentences  in  forum  posts,  etc.)  are  embedded  within  a  specific  context.  The  ironic  voice  is  an 
important  example  of  this  phenomenon:  to  appreciate  a  speaker’s  intended  meaning,  it  is  crucial  to  first 
infer  if  he  or  she  is  being  ironic  or  sincere. 

Existing  computational  approaches  to  irony  detection  leverage  statistical  natural  language  processing 
(NLP)  and  machine  learning  (ML)  methods.  These  models  tend  to  be  relatively  ‘shallow’  in  that  they 
operate  only  over  simple,  unstructured  representations  of  data.  For  example,  in  the  case  of  natural  language 
(text),  one  might  encode  documents  with  word  counts  or  functions  thereof,  and  in  the  case  of  network-based 

1 ‘tweets’  are  short  messages  posted  to  the  internet  for  the  consumption  of  ‘followers’  via  the  web  service  Twitter. 
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data  (e.g.  social  networks)  one  might  rely  on  analogously  simple  functions  of  link  counts.  Classification 
would  then  be  performed  by  algorithms  operating  over  these  encodings.  But  these  simple  representations 
will  often  be  insufficient  to  infer  ironic  intent  !6j.  In  this  project  we  therefore  aim  to  explore  novel  approaches 
to  irony  detection  that  are  motivated  by  linguistic  principles. 

To  this  end,  we  have  brought  together  a  diverse  team  comprising  members  with  unique  expertise.  Senior 
personnel  on  this  project  included  PI  Wallace  and  Charniak,  both  of  whom  have  substantive  expertise  in  sta¬ 
tistical  natural  language  processing;  we  have  combined  this  with  important  expertise  from  Kertz  in  linguists 
and  from  Trikalinos  in  statistics  and  experimental  design.  This  interdisciplinary  team  has  been  a  crucial 
property  of  our  approach:  in  our  view  previous  efforts  to  identify  irony  relied  too  heavily  on  standard  com¬ 
puter  science  methods,  largely  ignoring  the  perspectives  of,  e.g.,  linguistics  and  cognitive  scientists.  Indeed, 
as  part  of  this  broad  effort  of  facilitating  interdisciplinary  communication  around  this  important  problem, 
we  organized  and  ran  a  workshop  at  CogSci  2014  this  past  July,  which  included  speakers  and  at¬ 
tendees  from  both  computer  and  cogntive  science  (https://sites.google.eom/a/brown.edu/irony/)  [8|. 
We  now  continue  this  interdisciplinarity  at  UT  Austin  by  involving  Professor  David  Beaver  (from  linguistics 
and  philosophy). 

In  the  next  section  we  review  the  specific  objectives  of  this  project  and  then  discuss  our  progress  toward 
meeting  them  in  the  relevant  period. 

3  Specific  Objectives 

The  broad  specific  objectives  of  this  project  were  as  follows: 

1.  First,  to  collect  and  annotate  a  high-quality  corpus  to  facilitate  research  on  irony  detection.  Prior  to 
this  project,  no  such  high-quality  dataset  existed.  This  has  been  a  major  obstacle  to  progress  on 
automated  irony  detection. 

2.  Second,  to  analyze  when  existing  ML  and  NLP  technologies  fail  to  detect  ironic  intent  empirically.  We 
specifically  proposed  to  assess  quantitatively  (using  the  collected  dataset)  whether  context  is  necessary 
to  discern  ironic  intent  (and  how  often  this  is  the  case). 

3.  Finally,  we  aimed  to  develop  a  new  approach  to  irony  detection  that  instantiates  sociolinguistic  con¬ 
ceptions  of  irony  within  a  modern,  probabilistic  machine  learning  framework.  The  idea  was  that  this 
approach  would  be  informed  by  theoretical  sociolinguistic  perspectives  on  irony  (and  thus  likely  capa¬ 
ble  of  discerning  ironic  utterances  missed  by  existing  computational  models),  while  also  being  practical 
enough  to  be  operational. 

Below  we  enumerate  our  findings  thus  far  regarding  these  objectives. 


4  Scientific  Findings  and  Accomplishments 

We  have  at  least  partially  realized  aims  1  and  2,  slightly  ahead  of  our  slated  timeline.  Specifically,  as  described 
in  detail  in  the  following  subsections,  we  have:  (1)  written  code  to  scrape  comments  from  reddit ,  a  social-news 
website  that  we  use  as  our  corpus;  (2)  built  a  web-based  tool  to  facilitate  annotation  of  these  comments; 
(3)  assembled  and  trained  a  team  of  undergraduates  to  perform  this  annotaiton;  (4)  analyzed  the  resultant 
dataset.  This  analysis  was  summarized  in  our  publication  at  this  year’s  Association  for  Computational 
Linguistics  [7],  the  premiere  venue  in  natural  language  processing. 

4.1  Introducing  the  reddit  Irony  Dataset 

Here  we  introduce  the  first  version  (f)  1.0)  of  our  irony  corpus.  Reddit  (http://reddit.com)  is  a  social- 
news  website  to  which  news  stories  (and  other  links)  are  posted,  voted  on  and  commented  upon.  The  forum 
component  of  reddit  is  extremely  active:  popular  posts  often  have  well  into  1000’s  of  user  comments.  Red¬ 
dit  comprises  ‘sub-reddits’,  which  focus  on  specific  topics.  For  example,  http://reddit.eom/r/politics 
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®  ^  ^  ironist  x  ** 

C  127.0.0.1:8000/#  !□  = 

|  IRONATOR  P 

log  out  | 

© 

II  agree!;  Politicians  should  not  be  reading  jokes  or  email  chains  aloud  to  staff  on  business 
time!!’  " 

comment  identifier  1562 

Figure  1:  The  web-based  tool  we  built  that  was  used  by  our  annotators  to  label  reddit  comments.  Enumerated 
interface  elements  are  described  as  follows:  1  the  text  of  the  comment  to  be  annotated  -  sentences  marked 
as  ironic  are  highlighted;  2  buttons  to  label  sentences  as  ironic  or  unironic;  3  buttons  to  request  additional 
context  (the  embedding  discussion  thread  or  associated  webpage  -  see  Section  4.1.2);  4  radio  button  to 
provide  confidence  in  comment  labels  (Likert  scale  of  low ,  medium  and  high). 
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sub-reddit  (URL) 
politics  (r/politics) 
conservative  (r/conservative) 
progressive  (r /progressive) 
atheism  (r/atheism) 
Christianity  (r/Christianity ) 
technology  (r/technology ) 


description 

Political  news  and  editorials;  focus  on  the  US. 

A  community  for  political  conservatives. 

A  community  for  political  progressives  (liberals). 
A  community  for  non-believers. 

News  and  viewpoints  on  the  Christian  faith. 
Technology  news  and  commentary. 


number  of  labeled  comments 
873 
573 
543 
442 
312 
277 


Table  1:  The  six  sub-reddits  that  we  have  downloaded  comments  from  and  the  respective  numbers  of  which 
we  have  acquired  annotations  in  this  f3  version  of  the  corpus.  Note  that  we  acquired  labels  at  the  sentence 
level,  whereas  the  counts  above  reflect  comments ,  all  of  which  contain  at  least  one  sentence. 


features  articles  (and  hence  comments)  centered  around  political  news.  The  current  version  of  the  corpus  is 
available  at:  https://github.com/bwallace/ACL-2014-irony.  The  present  version  comprises  3,020  anno¬ 
tated  comments  scraped  from  the  six  subreddits  enumerated  in  Table  |T]  These  comments  in  turn  comprise 
a  total  of  10,401  labeled  sentences^ 

4.1.1  Annotation  Process 

Three  Brown  university  undergraduates  independently  annotated  each  sentence  in  the  corpus^  More  specif¬ 
ically,  annotators  have  provided  binary  ‘labels’  for  each  sentence  indicating  whether  or  not  they  (the  an¬ 
notator)  believe  it  was  intended  by  the  author  ironically  (or  not).  This  annotation  was  facilitated  via  a 
custom-built  browser-based  annotation  tool  built  as  part  of  this  project,  shown  in  Figure  |I] 

We  intentionally  did  not  provide  much  guidance  to  annotators  regarding  the  criteria  for  what  constitutes 
an  ‘ironic’  statement,  for  two  reasons.  First,  verbal  irony  is  a  notoriously  slippery  concept  [1]  and  coming 
up  with  an  operational  definition  to  be  consistently  applied  is  non-trivial.  Second,  we  were  interested  in 
assessing  the  extent  of  natural  agreement  between  annotators  for  this  task.  The  raw  average  agreement 
between  all  annotators  on  all  sentences  is  0.844.  Average  pairwise  Cohen’s  Kappa  |2j  is  0.341,  suggesting 
fair  to  moderate  agreement  [5],  as  we  might  expect  for  a  subjective  task  like  this  one.  Still,  ideally  we  would 
perhaps  achieve  better  agreement:  we  plan  on  re-visiting  issues  of  annotator  agreement  in  our  future  work 
at  UT  Austin  (at  the  very  least  this  provides  an  upper-bound  for  what  we  can  possibly  expect  from  an 
automated  approach). 

4.1.2  Context 

Reddit  is  a  good  corpus  for  the  irony  detection  task  in  part  because  it  provides  a  natural  practical  realization 
of  the  otherwise  ill-defined  context  for  comments  (and  the  sentences  they  comprise).  In  particular,  each 
comment  is  associated  with  a  specific  user  (the  author),  and  we  can  view  their  previous  comments.  Moreover, 
comments  are  embedded  within  discussion  threads  that  pertain  to  the  (usually  external)  content  linked  to  in 
the  corresponding  submission  (see  Figure  [2]).  These  pieces  of  information  (previous  comments  by  the  same 
user,  the  external  link  of  the  embedding  reddit  thread,  and  the  other  comments  in  this  thread)  constitute 
our  context.  All  of  this  is  readily  accessible.  Labelers  can  opt  to  request  these  pieces  of  context  via  the 
annotation  tool,  and  we  record  when  they  do  so. 

Consider  the  following  example  comment  taken  from  our  dataset:  “Great  idea  on  the  talkathon  Cruz. 
Really  made  the  republicans  look  like  the  sane  ones.”  Did  the  author  intend  this  statement  ironically,  or 
was  this  a  subtle  dig  on  Senator  Ted  Cruz?  Without  additional  context  it  is  difficult  to  know.  And  indeed, 
all  three  annotators  requested  additional  context  for  this  comment.  This  context  at  first  suggests  that  the 
comment  may  have  been  intended  literally:  it  was  posted  in  the  r/conservative  subreddit  (Ted  Cruz  is  a 
conservative  senator).  But  if  we  peruse  the  author’s  comment  history,  we  see  that  he  or  she  repeatedly 

-  We  performed  naive  ‘segmentation’  of  comments  based  on  punctuation. 

3Additional  undergraduates  have  since  worked  on  this  project,  but  only  three  annotators  are  included  in  our  /?  1.0  release 
of  the  corpus. 
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*hti  ffington  post  ,coro ) 

Dm  it  ted  2 1  hours  ago  t> v  Z«roCool79 

219  comments  shar*  t,  hia  report 
+  [»]  IrishJoe  2 IS  points  21  hours  ago 

The  GOP  outreach  to  American  women  continues  right  on  track. 

I  '  rm3link  report  give  9oid  leply 

♦  i  i  E2H3HEH  |S1  ‘1  panli  2J  hour; 

I  guess  they  don't  have  a  "host  outreach"  program. 

permalink  patent  report  jold  rr-piy 


Figure  2:  An  illustrative  reddit  comment  (highlighted).  The  title  (“Virginia  Republican  ...”)  links  to  an 
article,  providing  one  example  of  contextualizing  content.  The  conversational  thread  in  which  this  comment 
is  embedded  provides  additional  context.  The  comment  in  question  was  presumably  intended  ironically, 
though  without  the  aforementioned  context  this  would  be  difficult  to  conclude  with  any  certainty.  Because 
all  of  this  information  is  readily  available  online,  we  think  automated  approaches  ought  to  try  and  exploit 
it. 


derides  Senator  Cruz  (e.g.,  writing  “Ted  Cruz  is  no  Ronald  Reagan.  They  aren’t  even  close.”).  From  this 
contextual  information,  then,  we  can  reasonably  assume  that  the  comment  was  intended  ironically  (and  all 
three  annotators  did  so  after  assessing  the  available  contextual  information). 

4.2  Humans  Need  Context  to  Infer  Irony 

We  explore  the  extent  to  which  human  annotators  rely  on  contextual  information  to  decide  whether  or  not 
sentences  were  intended  ironically.  Recall  that  our  annotation  tool  allows  labelers  to  request  additional 
context  if  they  cannot  make  a  decision  based  on  the  comment  text  alone  (Figure  [l]).  On  average,  annotators 
requested  additional  context  for  30%  of  comments  (range  across  annotators  of  12%  to  56%).  As  shown  in 
Figure  [3j  annotators  are  consistently  more  confident  once  they  have  consulted  this  information. 

We  tested  for  a  correlation  between  these  requests  for  context  and  the  final  decisions  regarding  whether 
comments  contain  at  least  one  ironic  sentence.  We  denote  the  probability  of  at  least  one  annotator  requesting 
additional  context  for  comment  i  by  P(Ci).  We  then  model  the  probability  of  this  event  as  a  linear  function 
of  whether  or  not  any  annotator  labeled  any  sentence  in  comment  i  as  ironic.  We  code  this  via  the  indicator 
variable  X \  which  is  1  when  comment  i  has  been  deemed  to  contain  an  ironic  sentence  (by  any  of  the  three 
annotators)  and  0  otherwise. 


logit{P(Ci)}  =  /30  +  fiili  (1) 

We  used  the  regression  model  shown  in  Equation  [l]  where  /3o  is  an  intercept  and  /3i  captures  the  cor¬ 
relation  between  requests  for  context  for  a  given  comment  and  its  ultimately  being  deemed  to  contain  at 
least  one  ironic  sentence.  We  fit  this  model  to  the  annotated  corpus,  and  found  a  significant  correlation: 
/3i  =  1.508  with  a  95%  confidence  interval  of  (1.326,  1.690);  p  <  0.001. 

In  other  words,  annotators  request  context  significantly  more  frequently  for  those  comments  that  (are 
ultimately  deemed  to)  contain  an  ironic  sentence.  This  would  suggest  that  the  words  and  punctuation 
comprising  online  comments  alone  are  not  sufficient  to  distinguish  ironic  from  unironic  comments.  Despite 
this,  most  machine  learning  based  approaches  to  irony  detection  have  relied  nearly  exclusively  on  such 
intrinsic  features. 
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annotator  1 

ironic  -*  ironic 
ironic  -*  unironic 
unironic  -*  unironic 174 
unironic  -*  ironic 

forced  decision  final  decision 


annotator  2  annotator  3 


forced  decision  final  decision  forced  decision  final  decision 


Figure  3:  This  plot  illustrates  the  effect  of  viewing  contextual  information  for  three  annotators  (one  table 
for  each  annotator).  For  all  comments  for  which  these  annotators  requested  context,  we  show  forced  (before 
viewing  the  requested  contextual  content)  and  final  (after)  decisions  regarding  perceived  ironic  intent  on 
behalf  of  the  author.  Each  row  shows  one  of  four  possible  decision  sequences  (e.g.,  a  judgement  of  ironic 
prior  to  seeing  context  and  unironic  after).  Numbers  correspond  to  counts  of  these  sequences  for  each 
annotator  (e.g.,  the  first  annotator  changed  their  mind  from  ironic  to  unironic  86  times).  Cases  that  involve 
the  annotator  changing  his  or  her  mind  are  shown  in  red;  those  in  which  the  annotator  stuck  with  their 
initial  judgement  are  shown  in  blue.  Color  intensity  is  proportional  to  the  average  confidence  judgements 
the  annotator  provided:  these  are  uniformly  stronger  after  they  have  consulted  contextualizing  information. 
Note  also  that  the  context  frequently  results  in  annotators  changing  their  judgement. 

4.3  Machines  Probably  do,  too 

To  address  research  objective  2  above,  we  explored  whether  the  misclassifications  (with  respect  to  whether 
comments  contain  irony  or  not)  made  by  a  standard  text  classification  model  significantly  correlate  with 
those  comments  for  which  human  annotators  requested  additional  context.  It  turns  out  that  it  does.  This 
provides  evidence  that  bag-of-words  approaches  are  insufficient  for  the  general  task  of  irony  detection:  more 
context  is  necessary. 

Specifically,  we  implemented  a  baseline  classification  approach  using  vanilla  token  count  features  (binary 
bag-of-words).  We  removed  stop- words  and  limited  the  vocabulary  to  the  50,000  most  frequently  occurring 
unigrams  and  bigrams.  We  added  additional  binary  features  coding  for  the  presence  of  punctuational  features, 
such  as  exclamation  points,  emoticons  (for  example,  *;)’)  and  question  marks:  previous  work  m  has  found 
that  these  are  good  indicators  of  ironic  intent. 

For  our  predictive  model,  we  used  a  linear-kernel  SVM  (tuning  the  C  parameter  via  grid-search  over  the 
training  dataset  to  maximize  FI  score).  We  performed  five- fold  cross-validation,  recording  the  predictions 
yi  for  each  (held-out)  comment  i.  Average  FI  score  over  the  five-folds  was  0.383  with  range  (0.330,  0.412); 
mean  recall  was  0.496  (0.446,  0.548)  and  average  precision  was  0.315  (0.261,  0.380).  The  five  most  predictive 
tokens  were:  /,  yeah ,  guys ,  oh  and  shocked.  This  represents  reasonable  performance  (and  the  high  ranking 
tokens  are  as  expected);  but  obviously  there  is  quite  a  bit  of  room  for  improvement. 

We  now  explore  empirically  whether  the  these  misclassifications  are  made  on  the  same  comments  for  which 
annotators  requested  context.  To  this  end,  we  introduce  a  variable  A4 i  for  each  comment  i  such  that  A4 i  =  1 
if  yi  ^  yi ,  i.e. ,  A4 i  is  an  indicator  variable  that  encodes  whether  or  not  the  classifier  misclassified  comment 
i.  We  then  ran  a  second  regression  in  which  the  output  variable  was  the  logit-transformed  probability  of  the 
model  misclassifying  comment  i,  i.e.,  P(A4;).  Here  we  are  interested  in  the  correlation  of  the  event  that  one 
or  more  annotators  requested  additional  context  for  comment  i  (denoted  by  Ci )  and  model  misclassifications 
(adjusting  for  the  comment’s  true  label).  Formally: 


logit{P{Mi)}  =  d0  +  e1li  +  d2Ct  (2) 

Fitting  this  to  the  data,  we  estimated  02  =  0.930  with  a  95%  CI  of  (0.769,  1.093);  p  <  0.001.  Put  another 
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way,  the  model  makes  mistakes  on  those  comments  for  which  annotators  requested  additional  context  (even 
after  accounting  for  the  annotator  designation  of  comments). 


5  Progress  summary  and  next  steps 

Toward  realizing  our  first  objective,  we  have  collected  and  here  described  a  new,  publicly  available  corpus 
for  the  task  of  verbal  irony  detection  online.  The  data  comprises  comments  scraped  from  the  social  news 
website  reddit.  We  recorded  confidence  judgements  and  requests  for  contextualizing  information  for  each 
comment  during  annotation.  We  have  analyzed  this  corpus  to  provide  empirical  evidence  that  annotators 
quite  often  require  context  beyond  the  comment  under  consideration  to  discern  irony;  especially  for  those 
comments  ultimately  deemed  as  being  intended  ironically. 

Regarding  our  second  objective,  we  have  demonstrated  that  a  standard  token-based  machine  learning 
approach  misclassified  many  of  the  same  comments  for  which  annotators  tend  to  request  context.  Indeed 
we  have  shown  that  annotators  rely  on  contextual  cues  (in  addition  to  word  and  grammatical  features)  to 
discern  irony  and  have  argued  that  this  implies  computers  should,  too. 

The  obvious  next  step  (toward  realizing  our  third  objective)  is  to  develop  new  machine  learning  models 
that  exploit  the  contextual  information  available  in  the  corpus  we  have  curated  (e.g.,  previous  comments  by 
the  same  user,  the  thread  topic). 

We  are  now  working  on  this  at  UT  Austin.  For  example,  we  have  developed  a  method  that  exploits 
the  user-community  (sub-reddit)  to  which  a  comment  was  posted  to  improve  classification  performance.  We 
currently  have  a  paper  under  review  for  potential  presentation  at  the  2015  annual  meeting  of  the  Association 
for  Computational  Linguistics  (ACL)  describing  this  approach.  We  are  also  working  on  related  problems  of 
regularization  in  classification  models,  as  we  have  discovered  this  to  be  particularly  important  for  the  task 
of  irony  detection.  Finally,  we  are  now  collecting  additional  data  from  Twitter  to  be  manually  labeled  to  see 
if  similar  approaches  are  effective  on  this  kind  of  data. 
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